Skip to content

A61 update: Avoid re-resolution without any connection failures#539

Open
ejona86 wants to merge 2 commits intogrpc:masterfrom
ejona86:dualstack-pf-reduce-refresh
Open

A61 update: Avoid re-resolution without any connection failures#539
ejona86 wants to merge 2 commits intogrpc:masterfrom
ejona86:dualstack-pf-reduce-refresh

Conversation

@ejona86
Copy link
Copy Markdown
Member

@ejona86 ejona86 commented Mar 6, 2026

This new behavior we will be guaranteed not to request re-resolution more often every $NUM_SUBCHANNEL subchannel failures.

Some "initial pass" references were changed to "first pass" to use consistent language.


This change has limited impact due to DNS's limit of 1 query per 30 seconds. But it will prevent noop updates from causing extra DNS resolutions. This will be low priority to implement in languages, although in Java there's benefit in implementing it now to help when our "match pre-dualstack behavior" flag is enabled.

This new behavior we will be guaranteed not to request re-resolution
more often every $NUM_SUBCHANNEL subchannel failures.

Some "initial pass" references were changed to "first pass" to use
consistent language.
Comment on lines 133 to 156
@@ -143,18 +147,18 @@ all times, with no regard for the order of the addresses. Each
individual subchannel will provide [backoff behavior][backoff-spec],
reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff
has finished. The pick_first policy will therefore automatically
request a connection whenever a subchannel reports IDLE. We will count
the number of connection failures, and when that number reaches the
number of subchannels, we will request re-resolution; note that because
request a connection whenever a subchannel reports IDLE. When the
number of connection failures reaches the number of subchannels (and not
in the first pass), we will request re-resolution; note that because
the backoff state will differ across the subchannels, this may mean that
we have seen multiple failures of a single subchannel and no failures
from another subchannel, but this is a close enough approximation and
very simple to implement.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be clearer to decouple the discussion of re-resolution behavior from the rest of the description here by putting it in its own paragraph:

Suggested change
- We will wait for at least one connection attempt on every address to
fail before we consider the first pass to be complete. As per
[gRFC A62][A62], we will report TRANSIENT_FAILURE state and will
continue trying to connect. We will stay in TRANSIENT_FAILURE until
either (a) we become connected or (b) the LB policy is destroyed by the
channel shutting down or going IDLE.
If the first pass completes without a successful connection attempt, we
will switch to a mode where we keep trying to connect to all addresses at
all times, with no regard for the order of the addresses. Each
individual subchannel will provide [backoff behavior][backoff-spec],
reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff
has finished. The pick_first policy will therefore automatically request a
connection whenever a subchannel reports IDLE.
Each time a subchannel reports TRANSIENT_FAILURE, we will increase a
counter for the number of connection failures. The counter is reset any time
re-resolution is requested. When the number of connection failures reaches
the number of subchannels (and not in the first pass), we will request
re-resolution; note that because the backoff state will differ across the
subchannels, this may mean that we have seen multiple failures of a single
subchannel and no failures from another subchannel, but this is a close
enough approximation and very simple to implement.

Copy link
Copy Markdown
Member Author

@ejona86 ejona86 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your use of the github "suggestion" feature is broken. You might want to avoid using it next time or properly select the multiple lines necessary to make it useful. (Or... I don't even know. I see you selected multiple lines now. I have no clue what github is doing. Although the multiple lines selected still looks suspect.)

That seems strictly worse to me, as it needs to alternate between "first pass" and "not first pass". It's also not as clear whether this paragraph is a continuation of the "not first pass" paragraph above; in fact, reading what you wrote I'd assume it is a continuation and only applies when we swap "modes", because the "(and not in the first pass)" is a parenthetical.

I don't understand why "note that because the backoff" is combined with a semicolon with the previous sentence. They seem unrelated.

It looks like your new reorganization broke the language, because now we're not requesting re-resolution when we leave the first pass. Was that your intention? I can't actually tell your intention with the language/organization present.

Comment on lines +151 to +152
number of connection failures reaches the number of subchannels (and not
in the first pass), we will request re-resolution; note that because
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in the first pass? If we're going to trigger this based solely on the number of connection attempt failures, then shouldn't we do that regardless of whether or not we're in the first pass? It seems like if we're already in TF and we get a new address list, we shouldn't effectively reset the counter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think I changed any semantics here. I just made it more explicit. This paragraph only applies when we swap "modes" away from first pass.

I don't think we need to re-resolve during the first pass as we haven't actually tried all the addresses yet. We trigger re-resolution after that completes. If we are doing a first pass, we're not guaranteed to be in TF...

I could understand basing some of this off of "in TF" vs "not in TF," but the existing design uses "in first pass" vs "not in first pass." So changes there are more invasive.

@ejona86 ejona86 requested a review from markdroth March 12, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants